The corresponding journal entry for this assignment is available on GitHub at this link: https://github.com/bcb420-2022/Jedid_Ahn/wiki/Entry-%239:-Notes-from-lecture-for-A2
# Libraries to install here.
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
if (!requireNamespace("GEOquery", quietly = TRUE)) {
BiocManager::install("GEOquery")
}
if (!requireNamespace("edgeR", quietly = TRUE)) {
BiocManager::install("edgeR")
}
if (!requireNamespace("limma", quietly = TRUE)) {
BiocManager::install("limma")
}
if (!requireNamespace("circlize", quietly = TRUE)) {
install.packages("circlize")
}
if (!requireNamespace("ComplexHeatmap", quietly = TRUE)) {
BiocManager::install("ComplexHeatmap")
}
if (!requireNamespace("gprofiler2", quietly = TRUE)) {
install.packages("gprofiler2")
}
# Libraries to load here.
library(Biobase)
library(GEOquery)
library(edgeR)
library(limma)
library(circlize)
library(ComplexHeatmap)
library(gprofiler2)
# Figure count: Start at 1.
count <- 1
The GEO dataset that I chose was GSE66261, which examines the expression and functions of long noncoding RNAs during human T helper cell differentiation.
GEO <- "GSE66261"
count_folder_path <- paste0(getwd(), "/", GEO)
if (!dir.exists(count_folder_path)){
count_file_path <- rownames(GEOquery::getGEOSuppFiles(GEO))[1]
} else{
count_file_path <- list.files(path = count_folder_path, full.names = TRUE)[1]
}
GSE66261 <- read.delim(count_file_path, header = TRUE, check.names = FALSE)
To better understand the data, groups were defined by formatting according to library, cell type, and condition. There are 2 libraries: 2695 and 2960, 2 cell types: Primary and effector, and 3 conditions: TH1, TH2, and TH17.
listed_titles <- colnames(GSE66261)[3:14]
gsm_titles <- c("TH1_Primary_2695", "TH2_Primary_2695", "TH17_Primary_2695", "TH1_Effector_2695", "TH2_Effector_2695", "TH17_Effector_2695", "TH1_Primary_2960", "TH2_Primary_2960", "TH17_Primary_2960", "TH1_Effector_2960", "TH2_Effector_2960", "TH17_Effector_2960")
exp_names <- data.frame(listed_titles, gsm_titles)
samples <- data.frame(lapply(exp_names$gsm_titles,
function(x) { rev(unlist(strsplit(x, split = "_"))) }))
colnames(samples) <- listed_titles
rownames(samples) <- c("library", "cell_type", "condition")
samples <- data.frame(t(samples))
Before normalization, the dataset was filtered by removing genes with low counts. Using edgeR protocol, features without at least 1 read per million in n of the samples were removed. Since there are 6 samples of each group, n = 6.
n <- 6
# Translate out counts into counts per million using the edgeR package.
cpms = edgeR::cpm(GSE66261[, 3:14])
rownames(cpms) <- GSE66261[, 1]
# Get rid of low counts
keep = rowSums(cpms > 1) >= n
GSE66261_filtered = GSE66261[keep, ]
rownames(GSE66261) <- 1:nrow(GSE66261)
The original dataset already came with HUGO gene symbols, so no mapping of symbols was required. In addition, tests showed that nearly 90% of symbols lined up with the symbols obtained from biomaRt.
Tests showed that there were 18 rows in total that consists of mappings to the same symbol. Two rows were updated by replacing their original HUGO gene symbol with the mapped gene symbol, while the rest were dropped to avoid compromising normalization and downstream analysis.
any(GSE66261_filtered$Name == "RNY1P13")
## [1] FALSE
GSE66261_filtered$Name[which(GSE66261_filtered$Feature == "ENSG00000201900")] <- "RNY1P13"
any(GSE66261_filtered$Name == "STRA6LP")
## [1] FALSE
GSE66261_filtered$Name[which(GSE66261_filtered$Feature == "ENSG00000254876")] <- "STRA6LP"
remove = duplicated(GSE66261_filtered$Name)
GSE66261_filtered <- GSE66261_filtered[!remove, ]
Trimmed means of M-value (TMM) was chosen as the normalization technique as it a specialized normalization technique for RNASeq datasets.
filtered_data_matrix <- as.matrix(GSE66261_filtered[, 3:14])
rownames(filtered_data_matrix) <- GSE66261_filtered$Name
d = edgeR::DGEList(counts = filtered_data_matrix, group = samples$cell_type)
d = edgeR::calcNormFactors(d)
normalized_counts <- as.data.frame(edgeR::cpm(d))
rownames(normalized_counts) <- GSE66261_filtered$Name
Figure 1: We observe that TH1 and TH2 samples cluster all together as they are both involved in immune response, while TH17 samples cluster together separately from TH1 and TH2.
Figure 2: The boxplot shows that there aren’t significant differences in the mean for each sample after normalization.
Figure 3: The soft bimodal distribution is something to keep in mind when performing differential expression analysis.
Figure 4: The BCV plot confirms the trend of more gene expression leading to dispersion values being closer together, and less variation overall.
normalized_counts
Before running differential expression analysis, we have to create a linear model in R by creating a design matrix. We will account for both library variability and the condition of interest, which is contributing to the differential expression. This is demonstrated by the MDS plot shown in Figure 1, where samples are clustering according to condition.
model_design <- model.matrix(~ samples$library + samples$condition)
Fit the normalized counts to the model design created.
expression_mat <- as.matrix(normalized_counts)
minimal_set <- Biobase::ExpressionSet(assayData = expression_mat)
fit <- limma::lmFit(minimal_set, model_design)
Then, we use empirical Bayes to compute differential expression.
fit2 <- limma::eBayes(fit, trend = TRUE)
(Answer to Q2) The Benjamini Hochberg method was used as the paper associated with the experiment specified that false discovery rate was used to correct for multiple testing.
output_hits <- limma::topTable(fit2, coef = ncol(model_design), adjust.method = "BH", number = nrow(expression_mat))
# Then, we sort by pvalue
output_hits <- output_hits[order(output_hits$P.Value),]
(Answer to Q1) 2376 genes are differentially expressed pass the threshold p-value of 0.05. 0.05 was chosen as it is the most commonly used cutoff for significance, which signifies a 95% chance that differential expression would not be observed if the condition had no effect.
length(which(output_hits$P.Value < 0.05))
## [1] 2376
(Answer to Q2) On the other hand, 504 genes pass correction.
length(which(output_hits$adj.P.Val < 0.05))
## [1] 504
Figure 5: Differentially expressed genes are coloured and bolded in red.
Figure 6: We can see that the MA plot shows a high degree of confidence in RAD50 that it is differentially expressed and has high regulation.
(Answer to Q4) Next, we will create a heatmap to visualize upregulation and downregulation of the top hits in the data according to a colour scale. Based on Figure 7, we see that the conditions are clustered together due to the following groupings:
-TA-4 = TH1 and -TA-1 = TH1
-TA-2 = TH2 and -TA-5 = TH2
-TA-6 = TH17 and -TA-3 = TH17
Figure 7: Not only are the conditions clustered together, but their genes show similar patterns of upregulation and downregulation.
(Answer to Q1) Since we are focused on threshold calculation rather than rank calculation, options included DAVID, EnrichR, and g:Profiler. g:Profiler was used due to familiarity and because it is a popular tool for functional enrichment analysis and visualization of gene lists. g:Profiler was used through the gprofiler2 R package rather than through their web browser.
I will start by creating a thresholded list of genes, by calculating their rank and then retrieving upregulated and downregulated genes that are significant (pval < 0.05).
output_hits[, "rank"] <- -log(output_hits$P.Value, base = 10) * sign(output_hits$logFC)
output_hits <- output_hits[order(output_hits$rank), ]
upregulated_genes <- rownames(output_hits)[which(output_hits$P.Value < 0.05 &
output_hits$logFC > 0)]
downregulated_genes <- rownames(output_hits)[which(output_hits$P.Value < 0.05 &
output_hits$logFC < 0)]
all_genes <- rownames(output_hits)
To present the results, I will store them in tables.
dir.create(paste0(getwd(), "/data"))
write.table(x = upregulated_genes,
file = file.path("data", "GSE66261_upregulated_genes.txt"), sep = "\t",
row.names = FALSE, col.names = FALSE, quote = FALSE)
write.table(x = downregulated_genes,
file=file.path("data","GSE66261_downregulated_genes.txt"), sep = "\t",
row.names = FALSE, col.names = FALSE, quote = FALSE)
write.table(x = data.frame(genename = rownames(output_hits), F_stat = output_hits$rank),
file = file.path("data", "GSE66261_ranked_genelist.txt"), sep = "\t",
row.names = FALSE, col.names = FALSE, quote = FALSE)
There are 1035 upregulated genes that are significant.
length(upregulated_genes)
## [1] 1035
On the other hand, there are 1341 upregulated genes that are significant.
length(downregulated_genes)
## [1] 1341
(Answer to Q2) To maintain consistency with the differential expression analysis, Benjamini Hochberg FDR was used as the correction method for multiple hypothesis testing. Unfortunately, the corresponding paper for GSE66261 did not specify which annotation sources that they used for gene set enrichment.
GO:BP was used as gene ontology is a popular source for identifying relevant biological processes. In addition, KEGG, Reactome (REAC), and Wiki Pathways (WP) were all included as parameters due to a specific interest in learning the biological pathways associated with the genes of interest. Version 0.2.1 of the R package gprofiler2 was used, which is presumed to use the same version as the web browser tool. The web g:Profiler version is e105_eg52_p16_e84549, which was last updated on 03/01/2022.
gost_res <- gprofiler2::gost(all_genes, organism = "hsapiens",
correction_method = c("fdr"),
user_threshold = 0.05,
sources = c("GO:BP", "KEGG", "REAC", "WP"))
(Answer to Q3) A p-value threshold of 0.05 was used as previously done for differential expression analysis. 3862 genesets were returned in total, with nearly 75% (2861 out of 3862) coming from GO:BP.
nrow(gost_res$result)
## [1] 3862
table(gost_res$result$source)
##
## GO:BP KEGG REAC WP
## 2861 152 658 191
(Answer to Q3) If the p-value threshold is set to a more stringent value of 0.01, 3026 genesets are returned instead.
gost_res_stringent <- gprofiler2::gost(all_genes, organism = "hsapiens",
correction_method = c("fdr"),
user_threshold = 0.01,
sources = c("GO:BP", "KEGG", "REAC", "WP"))
nrow(gost_res_stringent$result)
## [1] 3026
table(gost_res_stringent$result$source)
##
## GO:BP KEGG REAC WP
## 2208 122 559 137
Observing the plot function that gprofiler2 provides as shown in Figure 8, the term names whose -log10(p-adj) value is greater than 16 are the most significant and the ones that we are most interested in.
Figure 8: Despite the visualization, it is hard to gauge the number of term names that we are most interested in in due to the capping of values whose -log10(p-adj) > 16.
(Answer to Q4) Next, we will run the g:Profiler analysis against the upregulated and downregulated genes both, with a threshold of 0.05.
gost_res_up <- gprofiler2::gost(upregulated_genes, organism = "hsapiens",
correction_method = c("fdr"),
user_threshold = 0.05,
sources = c("GO:BP", "KEGG", "REAC", "WP"))
gost_res_down <- gprofiler2::gost(downregulated_genes, organism = "hsapiens",
correction_method = c("fdr"),
user_threshold = 0.05,
sources = c("GO:BP", "KEGG", "REAC", "WP"))
(Answer to Q4) Using all genes, we see that the term names that are most enriched are involved in metabolic and biosynthetic processes.
knitr::kable(gost_res$result$term_name[1:10])
| x |
|---|
| cellular macromolecule metabolic process |
| organic substance biosynthetic process |
| organonitrogen compound metabolic process |
| biosynthetic process |
| cellular biosynthetic process |
| cellular protein metabolic process |
| regulation of cellular metabolic process |
| protein metabolic process |
| macromolecule biosynthetic process |
| regulation of primary metabolic process |
(Answer to Q4) Using only upregulated genes that are significant, we see that the term names that are most enriched are still involved in metabolic and biosynthetic processes.
knitr::kable(gost_res_up$result$term_name[1:10])
| x |
|---|
| regulation of cellular process |
| biological regulation |
| small molecule metabolic process |
| biosynthetic process |
| oxoacid metabolic process |
| organic acid metabolic process |
| organic substance biosynthetic process |
| carboxylic acid metabolic process |
| regulation of biological process |
| regulation of cellular metabolic process |
(Answer to Q4) On the other hand, enrichment of downregulated genes that are significant show term names that are involved in immune response and regulation as well as response to stimuli. Based on the results using all genes, this shows that upregulated genes dominate the landscape by contributing more to differential expression and downstream pathways.
knitr::kable(gost_res_down$result$term_name[1:10])
| x |
|---|
| immune system process |
| regulation of response to stimulus |
| response to stimulus |
| regulation of cellular process |
| positive regulation of biological process |
| immune response |
| cellular response to stimulus |
| biological process involved in interspecies interaction between organisms |
| intracellular signal transduction |
| response to stress |
Answer: Yes, they do. The original paper states that long non-coding RNAs (lncRNAs) play essential roles in arrays of cellular processes, which is consistent with the over-representation results of all genes and upregulated genes. The paper also states that differentiation of T helper cells is a critical step to adaptive immune response to pathogens, which is consistent with the over-representation results of downregulated genes.
Answer: Yes, the paper “Long Non-coding RNAs: Major Regulators of Cell Stress in Cancer” by Connerty et al. explains that lncRNAs have diverse roles in cellular processes such as acting as epigenetic modulators and promoting or inhibiting transcription. On the other hand, the textbook “Molecular Biology of the Cell. 4th edition.” by Alberts et al. states that helper T cells are required for almost all adaptive immune responses.